As partners (Hailey Dusablon and Jordan Stein), we are interested in exploring socioeconomic factors that may be correlated to health issues, specifically focusing on COVID-19 vaccination rates and what factors have the greatest impact on these numbers. Since we are both interested in careers in the healthcare field, analyzing this type of data would be a great opportunity to gain some insight into how lifestyle factors may affect the health of populations. This type of research is important because lifestyle choices can be changed, so it is useful to know how our everyday actions can impact our health. The first dataset we are interested in analyzing is the COVID Data Tracker published by the CDC (which can be found here: https://covid.cdc.gov/covid-data-tracker/#datatracker-home). This dataset contains information related to the prevalence of covid cases, deaths, and trends at the national, state, and county levels in the United States. More specifically, this dataset contains information regarding the state-wide distribution of people who have received the COVID vaccine. One question that would be interesting to explore with this data is how socioeconomic factors, such as poverty and minority group distribution, affect the spread of covid and the vaccination rates in different states and counties. In order to do this, we will be using the social determinants of health categories laid out by the U.S. Department of Health and Human services (https://health.gov/healthypeople/objectives-and-data/social-determinants-health) in order to choose factors to research as potential variables that affect vaccination rates in a state.cWe would also like to see if the regions with lower rates of the covid vaccine also have lower rates of other types of common vaccines. Since there is a significant stigma surrounding the COVID vaccine, it would be interesting to see how the COVID vaccine rates compare to other vaccine rates. We are considering this research because COVID has been a very relevant problem for the past year and a half and the COVID vaccine has been very controversial. We have not seen a lot of covid research studying social factors such as religion, access to healthcare resources, and income.
To begin the project, we started by setting up a Github repository and ensuring that we were both listed as collaborators. Since the website was created with Hailey’s account, Jordan cloned the repository on her computer. After this website was established, we met on Zoom to discuss datasets we were interested in exploring and possible ideas about how we would analyze these datasets. For future collaboration, we are planning to meet twice a week on Tuesdays and Thursdays for two hours each day via Zoom or in-person and will be coordinating code through a private Github repository.
The biggest challenge we had was choosing what factors to research to achieve our goals. By choosing to focus on the social determinants of health, we were able to narrow down what data we were looking for. We were then able to find relevant data. A lot of the data sets that we found were so extensive that they were too big for Pandas to handle without crashing, so some data sets had to be edited down in Microsoft Excel in order for them to be viable for use. We also had issues in creating the dataframe with deciding how to tidy and reformat the data. We decided it would be best to drop the columns that we would not need for our research purposes. Lastly, we dropped all the columns about the booster vaccine since the data was only available at a national level (because the booster vaccine was only developed very recently). Another challenge in starting this project was learning how to navigate GitHub. It took a bit of trial-and-error to figure out how to organize our information on the site. Now, we are more familiar and comfortable with using GitHub.
A CDC analysis found that "counties with high social vulnerability had lower vaccination rates than counties with low social vulnerability." More information about this research can be found here: https://www.kff.org/coronavirus-covid-19/issue-brief/vaccination-is-local-covid-19-vaccination-rates-vary-by-county-and-key-characteristics/
The goal of this project is to use Data Science techniques to examine data sets about COVID vaccination coverage in the US and see how this data relates to socioeconomic status. The motivation for this investigation is to determine any significant correlations between various socioeconomic factors and COVID vaccination rates. The same source listed above states that "Ensuring access to COVID-19 vaccines for these communities can help address the disparate health effects of the virus and achieve herd immunity." This type of research can be helpful in identifying specific factors that may affect the decision-making processes for vaccine distribution.
Using available data, we will examine the social determinants of health one by one as we examine our data:

Healthy People 2030, U.S. Department of Health and Human Services, Office of Disease Prevention and Health Promotion. Retrieved from https://health.gov/healthypeople/objectives-and-data/social-determinants-health
In this analysis, the COVID vaccine rates for each location will be the dependent variables, and various social factors will serve as independent variables.
We will begin by loading in our CSV files and turning them into one coherent data frame. Since so many factors are being considered, it was impossible to find a single data frame from a single source. Thus, we have imported several CSVs from different sources to create a new data frame containing all of the factors we will be exploring.
#Import the necessary libraries:
!pip install plotly
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly
import plotly.express as px
from sklearn.metrics import pairwise_distances
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import linear_model
Collecting plotly
Downloading plotly-5.4.0-py2.py3-none-any.whl (25.3 MB)
|████████████████████████████████| 25.3 MB 15.0 MB/s eta 0:00:01
Collecting tenacity>=6.2.0
Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Requirement already satisfied: six in /opt/conda/lib/python3.9/site-packages (from plotly) (1.16.0)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.4.0 tenacity-8.0.1
# Import our CSV files to DataFrames:
vac_state = pd.read_csv("./data/vaccine_state.csv")
state_income_df = pd.read_csv("./data/StateIncome.csv")
school_rank = pd.read_csv("./data/School_Ratings.csv")
vaccine_cov = pd.read_csv("./data/Vaccination_Coverage.csv")
state_demo_df = pd.read_csv("data/state_demographics.csv")
unemployment_df = pd.read_csv("data/Unemployment_Rates.csv")
beds_df = pd.read_csv("./data/Beds_per_1000.csv")
politics_df = pd.read_csv("./data/politics.csv")
safety_df = pd.read_csv("./data/safety.csv")
transport_df = pd.read_csv("./data/transport.csv")
# Get rid of max rows & columns so whole dataframe can be examined:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Tidy the data so that all state names are consistent
vac_state = vac_state.set_index("State")
vac_state = vac_state.rename(index = {'New York State': "New York"})
vac_state = vac_state.drop(['United States', "District of Columbia"], axis = 0)
vac_state.head()
| Percent Pop with At Least One Dose | |
|---|---|
| State | |
| Alaska | 58.9 |
| Alabama | 54.0 |
| Arkansas | 57.5 |
| Arizona | 60.8 |
| California | 73.9 |
# add state codes to data frame because plotly only recognizes abbreviated state names
covid_vaccine_update = vac_state.reset_index()
state_codes = {
'Mississippi': 'MS', 'Oklahoma': 'OK',
'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR',
'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA',
'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ',
'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT',
'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT',
'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV',
'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI',
'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND',
'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY',
'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH',
'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD',
'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA',
'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX',
'Nevada': 'NV', 'Maine': 'ME'}
covid_vaccine_update['state_code'] = covid_vaccine_update['State'].apply(lambda x : state_codes[x])
#Make some changes to our smaller dataframes to prepare for combining:
#the main idea is to merge all of this data on each state
#so that the final data frame contains socioeconomic data for each state
state_income_df = state_income_df.set_index("State")
state_income_df = state_income_df.rename(columns = {"HouseholdIncome" : "Median Household Income"})
school_rank = school_rank.drop(['qualityRank', 'safetyRank'], axis = 1)
school_rank = school_rank.set_index("State")
school_rank = school_rank.rename(columns = {"overallRank": "Public School Rankings"})
state_demo_df = state_demo_df.set_index("State")
state_demo_df = state_demo_df.drop(['District of Columbia'], axis = 0)
unemployment_df = unemployment_df.set_index("State")
unemployment_df["Unemployment_Rate"] = unemployment_df["Unemployment_Rate"]*100
beds_df = beds_df.rename(columns = {"Location" : "State"})
beds_df.set_index("State", inplace = True)
beds_df = beds_df.drop(columns = ["State/Local Government", "Non-Profit", "For-Profit"])
beds_df = beds_df.rename(columns = {"Total" : "Total Beds per 1000"})
politics_df.set_index("State", inplace = True)
safety_df.set_index("State", inplace = True)
transport_df.set_index("State", inplace = True)
# Join all of the cleaned data frames on the states!
combined_df = state_demo_df
combined_df = vac_state.join(combined_df, on = "State")
combined_df = combined_df.join(state_income_df, on = "State")
combined_df = combined_df.join(school_rank, on = "State")
combined_df = combined_df.join(unemployment_df, on = "State")
combined_df = combined_df.join(beds_df, on = "State")
combined_df = combined_df.join(politics_df, on = "State")
combined_df = combined_df.join(safety_df, on = "State")
combined_df = combined_df.join(transport_df, on = "State")
combined_df = combined_df.rename(columns = {"Median Houseold Income": "Median Household Income"})
combined_df = combined_df.rename(columns = {"Percent Under 65 Years Witout Health insurance": "Percent Under 65 Years Without Health insurance"})
# standadrize data
combined_df = (combined_df - combined_df.mean())/combined_df.std()
combined_df = combined_df.fillna(0)
combined_df.head()
| Percent Pop with At Least One Dose | Percent Under 18 Years | Percent 65 and Older | Percent White | HS Education or Higher | College Education or Higher | Percent Under 66 Years With a Disability | Percent Under 65 Years Without Health insurance | Persons Below Poverty Level | Population per Square Mile | Median Household Income | Public School Rankings | Unemployment_Rate | Total Beds per 1000 | voting_index | safetyScore | Percent_Spent_Transportation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | |||||||||||||||||
| Alaska | -0.546219 | 1.230784 | -2.322746 | -1.088759 | 1.208038 | -0.312800 | -0.154748 | 1.069686 | -0.757490 | -0.742123 | 1.458899 | 1.500649 | 0.243555 | -0.523658 | 0.467100 | -0.490177 | 4.184937 |
| Alabama | -1.090439 | -0.009301 | 0.172594 | -0.780237 | -1.213910 | -1.097670 | 1.359088 | 0.453636 | 1.243706 | -0.385160 | -1.175728 | 1.231475 | 1.225632 | 0.805942 | 1.018792 | -1.345503 | 0.186067 |
| Arkansas | -0.701711 | 0.507401 | 0.224581 | 0.023545 | -1.067125 | -1.576250 | 1.919768 | 0.229618 | 1.503120 | -0.532235 | -1.461411 | 1.096887 | -0.836730 | 0.689054 | 1.110741 | -1.524907 | -0.009788 |
| Arizona | -0.335195 | 0.145710 | 0.536498 | 0.315829 | -0.883644 | -0.331943 | -0.266884 | 0.985679 | 0.502523 | -0.531086 | -0.358336 | 1.567943 | 0.538178 | -0.918154 | -0.084593 | 0.114995 | -1.126115 |
| California | 1.119761 | 0.145710 | -1.127062 | -0.552904 | -2.278099 | 0.510357 | -1.388244 | -0.330426 | -0.127484 | 0.169052 | 1.225123 | 0.760419 | 0.341763 | -1.035042 | -1.647722 | -0.164218 | -1.153192 |
# check that all data types are numerical
combined_df.dtypes
Percent Pop with At Least One Dose float64 Percent Under 18 Years float64 Percent 65 and Older float64 Percent White float64 HS Education or Higher float64 College Education or Higher float64 Percent Under 66 Years With a Disability float64 Percent Under 65 Years Without Health insurance float64 Persons Below Poverty Level float64 Population per Square Mile float64 Median Household Income float64 Public School Rankings float64 Unemployment_Rate float64 Total Beds per 1000 float64 voting_index float64 safetyScore float64 Percent_Spent_Transportation float64 dtype: object
Our final, cleaned data frame contains socioeconomic information for each state. All of the data was standardized to ensure consistent scales. There is no missing data and all data types are numerical.
fig = px.choropleth(covid_vaccine_update, locations='state_code',
locationmode="USA-states", color= "Percent Pop with At Least One Dose", scope="usa")
fig.show()
This map above illustrates the percent of each state population that has received at least one dose of the COVID vaccine. Our assumption is that people that have received at least one vaccine will be more likely to get completely vaccinated in time. We will be using several predictive variables to try to recreate this map without using the actual vaccination data. This data is from our vac_state dataframe, which we downloaded from the CDC.
corr = combined_df.corr()
fig, ax = plt.subplots(figsize = (15,15))
sns.heatmap(corr, annot=True)
plt.title("Correlations Between Various Socioeconmoic Factors", fontsize = 20)
plt.show()
The purpose of the heatmap above is to address any confounding variables. Some variables, such as income and poverty, and education and income, are bound to be highly correlated (positively or negatively). While we will still be exploring some of these factors individually, it is important to keep in mind that many of these variables are not independent of each other.
We will begin by looking at education and comparing different education factors of a state to the percent of the state's population that has at least one dose of the vaccine. In this section, we will calculate the correlation between Percent of State Populations with High School Educations, College Educations, Public School Rankings, and Vaccine Rates.
hs_variables = ["HS Education or Higher", 'Percent Pop with At Least One Dose']
hs_corr = combined_df[hs_variables].corr()
college_variables = ["College Education or Higher", 'Percent Pop with At Least One Dose']
college_corr = combined_df[college_variables].corr()
school_variables = ["Public School Rankings", 'Percent Pop with At Least One Dose']
school_corr = combined_df[school_variables].corr()
display(hs_corr)
display(college_corr)
display(school_corr)
| HS Education or Higher | Percent Pop with At Least One Dose | |
|---|---|---|
| HS Education or Higher | 1.000000 | 0.147417 |
| Percent Pop with At Least One Dose | 0.147417 | 1.000000 |
| College Education or Higher | Percent Pop with At Least One Dose | |
|---|---|---|
| College Education or Higher | 1.000000 | 0.729796 |
| Percent Pop with At Least One Dose | 0.729796 | 1.000000 |
| Public School Rankings | Percent Pop with At Least One Dose | |
|---|---|---|
| Public School Rankings | 1.000000 | -0.478508 |
| Percent Pop with At Least One Dose | -0.478508 | 1.000000 |
# create scatterplot to visualize relationship between college education and percent of population that has been vaccinated
plt.rc('figure', figsize=(15,7))
combined_df.plot(kind = "scatter", x = 'College Education or Higher', y = 'Percent Pop with At Least One Dose',
color = 'red', title = "College Education vs. Vaccination Rates in Each State")
x = combined_df["College Education or Higher"]
y = combined_df['Percent Pop with At Least One Dose']
m, b = np.polyfit(x,y,1)
plt.plot(x, m*x+b)
for state, row in combined_df.iterrows():
plt.annotate(state,(x[state], y[state]), fontsize = 10)
There is a strong positive correlation between having a bachelor's degree or higher and the percent of the total population that has received at least one dose of a vaccine. The correlation is 0.729796, and the scatter plot shows only a strong positive correlation between these variables. This indicates that states with highly educated populations may be more likely to have a larger percent of their population vaccinated, despite there not being a strong correlation between high school education and the vaccination rates. Public School rankings also had a fairly significant correlation of -0.478508 to vaccine rates. This relationship is negative because in our data, a higher ranking has a lower number on the ranking scale.
We will now be examining economic stability factors, includung the most recent data for each state about median household income, unemployment rates, poverty rates, and individuals with disabilities. All of these are compared to the percentage of people in the state that have had at least one dose of the vaccine.
income_variables = ["Median Household Income", "Percent Pop with At Least One Dose"]
income_corr = combined_df[income_variables].corr()
employ_variables = ['Unemployment_Rate', "Percent Pop with At Least One Dose"]
employ_corr = combined_df[employ_variables].corr()
pov_variables = ["Persons Below Poverty Level", "Percent Pop with At Least One Dose"]
poverty_corr = combined_df[pov_variables].corr()
dis_variables = ["Percent Under 66 Years With a Disability", "Percent Pop with At Least One Dose"]
disabilities_corr = combined_df[dis_variables].corr()
display(income_corr)
display(employ_corr)
display(poverty_corr)
display(disabilities_corr)
| Median Household Income | Percent Pop with At Least One Dose | |
|---|---|---|
| Median Household Income | 1.000000 | 0.630108 |
| Percent Pop with At Least One Dose | 0.630108 | 1.000000 |
| Unemployment_Rate | Percent Pop with At Least One Dose | |
|---|---|---|
| Unemployment_Rate | 1.000000 | 0.101691 |
| Percent Pop with At Least One Dose | 0.101691 | 1.000000 |
| Persons Below Poverty Level | Percent Pop with At Least One Dose | |
|---|---|---|
| Persons Below Poverty Level | 1.000000 | -0.443451 |
| Percent Pop with At Least One Dose | -0.443451 | 1.000000 |
| Percent Under 66 Years With a Disability | Percent Pop with At Least One Dose | |
|---|---|---|
| Percent Under 66 Years With a Disability | 1.000000 | -0.392477 |
| Percent Pop with At Least One Dose | -0.392477 | 1.000000 |
plt.rc('figure', figsize=(15,7))
combined_df.plot(kind = "scatter", x = 'Median Household Income',
y = 'Percent Pop with At Least One Dose', color = "red")
x = combined_df["Median Household Income"]
y = combined_df["Percent Pop with At Least One Dose"]
m, b = np.polyfit(x,y,1)
plt.plot(x, m*x+b)
plt.title("Median Household Income vs. Vaccination Rates for Each State")
for state, row in combined_df.iterrows():
plt.annotate(state,(x[state], y[state]), fontsize = 10)
As expected, median household income has a significant correlation to vaccination rates. It is interesting to note on the scatterplot that many of the southern states (Louisiana, Arkansas, Mississippi, etc) tend to have lower median incomes, whereas several northern states (Massachusetts, Connecticut, New Hampshire) tend to have higher median incomes. This same pattern can also be seen in the college education plot from earlier. This is not surprising, since we already established that education and income were confounding variables. Persons Below Poverty Level also seems to impact covid vaccination rates, with a correlation of -0.443451, which means higher levels of poverty may be linked to lower vaccination rates. There were no strong correlations between individuals with disabilities and unemployment rates.
Another important health determinant is how much access people have to healthcare and if that access is of high or low quality. This was a difficult subject to obtain data for, since there is no concrete way to measure healthcare accessiblity. In this section, we will examine the percentage of people in a state without access to health insurance as well as how many hospital beds are available per 1000 people in each state.
ins_variables = ["Percent Under 65 Years Without Health insurance", "Percent Pop with At Least One Dose"]
insurance_corr = combined_df[ins_variables].corr()
beds_variables = ["Total Beds per 1000", "Percent Pop with At Least One Dose"]
beds_corr = combined_df[beds_variables].corr()
display(insurance_corr)
display(beds_corr)
| Percent Under 65 Years Without Health insurance | Percent Pop with At Least One Dose | |
|---|---|---|
| Percent Under 65 Years Without Health insurance | 1.000000 | -0.505554 |
| Percent Pop with At Least One Dose | -0.505554 | 1.000000 |
| Total Beds per 1000 | Percent Pop with At Least One Dose | |
|---|---|---|
| Total Beds per 1000 | 1.000000 | -0.533669 |
| Percent Pop with At Least One Dose | -0.533669 | 1.000000 |
plt.rc('figure', figsize=(15,7))
combined_df.plot(kind = "scatter", x = 'Percent Under 65 Years Without Health insurance',
y = 'Percent Pop with At Least One Dose', title = "Health Insurance vs. Vaccination Rates for Each State"
, color = "red")
x = combined_df["Percent Under 65 Years Without Health insurance"]
y = combined_df["Percent Pop with At Least One Dose"]
m, b = np.polyfit(x,y,1)
plt.plot(x, m*x+b)
for state, row in combined_df.iterrows():
plt.annotate(state,(x[state], y[state]), fontsize = 10)
There were notable correlations between uninsured individuals and the number of hospital beds. States with more insured people tend to have higher vaccination rates. Unexpectedly, states with less hospital beds had populations with higher vaccination coverage. This could mean that areas with less hospital beds have healthier populations, but it is also possible that the number of hospital beds is not the best way to measure access to healthcare resources.
Another important health determinant is the neighborhood and built environment context. In this category, we will look at the safety ranking of public schools in each state. We felt this metric would provide the most general measure in terms of community safety for each state, but in the future, it would also be interesting to explore crime and incarceration rates per state in relation to safety and vaccine coverage.
plt.rc('figure', figsize=(15,7))
combined_df.plot(kind = "scatter", x = 'safetyScore',
y = 'Percent Pop with At Least One Dose', title = "Public School Safety vs. Vaccination Rates for Each State"
,color = "red")
x = combined_df["safetyScore"]
y = combined_df["Percent Pop with At Least One Dose"]
m, b = np.polyfit(x,y,1)
plt.plot(x, m*x+b)
for state, row in state_demo_df.iterrows():
plt.annotate(state,(x[state], y[state]), fontsize = 10)
# calculate correlation for this relationship
variables = ["safetyScore", "Percent Pop with At Least One Dose"]
safety_corr = combined_df[variables].corr()
safety_corr
| safetyScore | Percent Pop with At Least One Dose | |
|---|---|---|
| safetyScore | 1.000000 | 0.479008 |
| Percent Pop with At Least One Dose | 0.479008 | 1.000000 |
Public school safety has a notable correlation of 0.479008 to COVID vaccination rates. Again, we see that many southern states rank lower in terms of safety, whereas many northern states are high-ranking in safety. This data tells indicates that safer communities may be linked to higher vaccination coverage.
The final determinant category we will be looking at is social and community factors. These will include race & ethnicity makeups, under 18 percentage, population density and political affiliation.
race_variables = ["Percent White", "Percent Pop with At Least One Dose"]
race_corr = combined_df[race_variables].corr()
kid_variables = ["Percent Under 18 Years", "Percent Pop with At Least One Dose"]
kid_corr = combined_df[kid_variables].corr()
adult_variables = ["Percent 65 and Older", "Percent Pop with At Least One Dose"]
adult_corr = combined_df[adult_variables].corr()
density_variables = ["Population per Square Mile", "Percent Pop with At Least One Dose"]
density_corr = combined_df[density_variables].corr()
pol_variables = ["voting_index", "Percent Pop with At Least One Dose"]
politics_corr = combined_df[pol_variables].corr()
display(race_corr)
display(kid_corr)
display(adult_corr)
display(density_corr)
display(politics_corr)
| Percent White | Percent Pop with At Least One Dose | |
|---|---|---|
| Percent White | 1.000000 | -0.143627 |
| Percent Pop with At Least One Dose | -0.143627 | 1.000000 |
| Percent Under 18 Years | Percent Pop with At Least One Dose | |
|---|---|---|
| Percent Under 18 Years | 1.000000 | -0.555766 |
| Percent Pop with At Least One Dose | -0.555766 | 1.000000 |
| Percent 65 and Older | Percent Pop with At Least One Dose | |
|---|---|---|
| Percent 65 and Older | 1.000000 | 0.237611 |
| Percent Pop with At Least One Dose | 0.237611 | 1.000000 |
| Population per Square Mile | Percent Pop with At Least One Dose | |
|---|---|---|
| Population per Square Mile | 1.000000 | 0.542737 |
| Percent Pop with At Least One Dose | 0.542737 | 1.000000 |
| voting_index | Percent Pop with At Least One Dose | |
|---|---|---|
| voting_index | 1.000000 | -0.853151 |
| Percent Pop with At Least One Dose | -0.853151 | 1.000000 |
plt.rc('figure', figsize=(15,7))
combined_df.plot(kind = "scatter", x = 'voting_index',
y = 'Percent Pop with At Least One Dose',
title = "Political Affiliation vs. Vaccination Rates for Each State",
color = "red")
x = combined_df["voting_index"]
y = combined_df["Percent Pop with At Least One Dose"]
m, b = np.polyfit(x,y,1)
plt.plot(x, m*x+b)
for state, row in state_demo_df.iterrows():
plt.annotate(state,(x[state], y[state]), fontsize = 10)
Political affiliation had a very high correlation of -0.853151 to vaccine rates. Population density and percent of population under 18 also had fairly strong correlations to our vaccine data. Surprisingly, older populations above 65 was not strongly correlated, which we did not expect because older populations are more at risk for COVID.
# create a bar graph that illustrates the correlations that were calculated above
correlation_plot = pd.DataFrame({"Correlation": [-0.853151, -0.555766, -0.533669, -0.505554, -0.478508,
-0.443451, -0.392477, -0.143627, 0.101691, .147417,
0.237611, 0.479008, 0.542737, 0.630108, 0.729796]},
index = ["Political Affiliation", "Age Under 18", "Hospital Beds",
"Uninsured", "School Rank", "Poverty", "Disability",
"Percent White", "Unemployment", "HS Education",
"Age Above 65", "Safety", "Pop Density", "Income", "College Education"])
correlation_plot.plot(kind="bar", grid = True)
plt.plot(0, 0.4, color = "red")
plt.title("Correlations for Socioeconomic Variables", fontsize = 20)
plt.xlabel("Calcuated Correlation")
plt.ylabel("Socioeconomic Factor")
Text(0, 0.5, 'Socioeconomic Factor')
The graph above provides a nice summary of the correlations found in our data exploration. To create analyzie our data more consicely, we will only be considering variables that had correlations greater than or equal to 0.4 and less than or equal to -0.4. Thus, the variables we will consider are political affiliation, age under 18, hospital beds, health insurance, public school rankings, poverty, safety, population density, income, and college education.
Using the correlations we have found between these variables and the vaccination rates of each state, we are able to build a general model that can recreate the ranking of each state (or rank other similar entities). The model we have developed is this:
The higher a vaccination score, the higher a state's vaccination rate is predicted to be. Variables can be removed from the equation for a less accurate model if that variable is unavailable for the entity the model is being used for. The model works best with entities that are similar to U.S. states.
The factors that we found to most strongly correlate to vaccination status include political affiliation, percent of population with at least college education, and income. Again, we are only considering variables that we found to have significant correlations with vaccination data.
# apply this formula to each value in our data frame to
# produce a score for each state
state_rank_df = pd.DataFrame()
for state, row in enumerate(combined_df):
score = (0.729796*(combined_df["College Education or Higher"])+
-0.478508*(combined_df["Public School Rankings"]) +
0.630108*(combined_df["Median Household Income"]) +
-0.443451*(combined_df["Persons Below Poverty Level"]) +
-0.505554*(combined_df["Percent Under 65 Years Without Health insurance"])+
-0.533669*(combined_df["Total Beds per 1000"]) +
0.479008*(combined_df["safetyScore"]) +
-0.555766*(combined_df["Percent Under 18 Years"]) +
0.542737*(combined_df["Population per Square Mile"]) +
-0.853151*(combined_df["voting_index"]))
state_rank_df["Score"] = score
state_rank_df["Standardized Percent"] = (vac_state["Percent Pop with At Least One Dose"] -
vac_state["Percent Pop with At Least One Dose"].mean())/vac_state["Percent Pop with At Least One Dose"].std()
state_rank_df = state_rank_df.sort_values(by = ["Score"])
state_rank_df.head()
| Score | Standardized Percent | |
|---|---|---|
| State | ||
| Mississippi | -7.952661 | -1.345889 |
| Oklahoma | -6.480638 | -0.568432 |
| Louisiana | -6.464135 | -1.145972 |
| Arkansas | -5.995361 | -0.701711 |
| West Virginia | -5.325036 | -1.667979 |
# Create a bar plot to visualize the distributions of our predicted scores
state_rank_plot = state_rank_df.reset_index()
state_rank_plot.plot.bar(x = "State", y = "Score", title = "Predicted Score Based on Formula", grid = True)
plt.ylabel("Predicted Score")
# Let's see how this data looks on a map!
# need to replace state names with state codes for plotly
state_rank_df = state_rank_df.reset_index()
state_rank_df['state_code'] = state_rank_df['State'].apply(lambda x : state_codes[x])
state_rank_df = state_rank_df.set_index("State")
fig = px.choropleth(state_rank_df, locations='state_code',
locationmode="USA-states", color= "Score", scope="usa")
fig.show()
Our graph and the scores for each state fit our predictions of what states would perform best with vaccination rates, with northeastern states performing very well and southern states being at the lower end, similar to our actual graph.
# calculate euclidian distances between percent vaccinated pop and covid scores
state_rank_df = state_rank_df.drop(columns = ["state_code"])
#from sklearn.metrics import pairwise_distances
distance = pairwise_distances(state_rank_df, metric="euclidean")
np.fill_diagonal(distance, float("Inf"), wrap=False)
l = []
for i, d in enumerate(distance):
#l.append(state_rank_df.iloc[[i]])
l.append(d.min())
state_rank_reset = state_rank_df.reset_index()
distances = pd.DataFrame(l, columns = ["Distance"])
state_dist = state_rank_reset.join(distances)
state_dist = state_dist.sort_values(by = ["Distance"])
state_dist.head()
| State | Score | Standardized Percent | Distance | |
|---|---|---|---|---|
| 29 | Utah | -0.042969 | -0.346302 | 0.170399 |
| 28 | Iowa | -0.078759 | -0.512900 | 0.170399 |
| 41 | Virginia | 4.218758 | 0.731032 | 0.177396 |
| 40 | Washington | 4.157812 | 0.564434 | 0.177396 |
| 27 | Michigan | -0.152560 | -0.679498 | 0.182213 |
state_dist.plot(x = "State", y = ["Score", "Standardized Percent", "Distance"],
kind = "bar", grid = True)
plt.ylabel("Vaccine Coverage (our scores and literature vaccine rates)")
plt.title("Vaccine Coverage vs. State Predictions")
Text(0.5, 1.0, 'Vaccine Coverage vs. State Predictions')
Because we created this formula based of of only 50 states, we were unable to create a model to predict covid coverage for the sates due to overfitting. However, we calculated the Euclidean distances between our COVID scores and the literature vaccine coverage values to see which states our calculations performed best for. Larger distances are equated to less accurate predictions. It should also be notes that because we are calculating scores based on our own system and not vaccine rates, our results are not meant to match perfectly, but we predict states with higher scores should have higher vaccination rates. In the plot above, you can see that states with higher scores tend to have higher numbers of vaccinated people. Some states such as Florida and New Mexico are very off, so clearly there are errors in our model.
We can also attempt to calculate the percent error of our predictive model. We cannot do this perfectly, as our predictions are not percent vaccinated but rather a score that represents the relative highness of a state's vaccination rate. In order to see how close we were, we can assume that we got one state correct and see how far off the other states are from that. Massachusetts is ranked #1 in vaccinations in both our predictions and the actual data, so we will scale our predictive scoring on the assumption that we get the Massachusetts percentage correct so that we may look at the percent error of the other 49 states relative to how close we were to the Massachusetts value. Our percentage spread and our score spread are different, at a ratio of 27.7:17, or 1:0.61. To fix this, we will need to take a few steps:
1) Add the lowest score +1 to everything to remove all negative numbers, starting our scoring system at 1.
2) Multiply all of our scores by 0.61 so our scores are on the same scale.
3) After doing all of this, our Massachusetts predicted score is 49.6 lower than the real Massachusetts score. We then added this number to all of our values, bringing the percent error of Massachusetts to 0 so that we can examine the other errors relative to this one.
4) Now we can calculate percent error of all other states relative to how close we were with our Massachusetts variable by using the standard percent error formula: (predicted val-actual val)/actual val * 100.
percent_err_df = pd.read_csv("./data/percent_v_score.csv")
percent_err_df = percent_err_df.set_index("State")
lowest = percent_err_df["Score"].min()
lowest = abs(lowest)
percent_err_df["Score"]=(percent_err_df["Score"]+lowest+1)
percent_err_df["Score"]=((percent_err_df["Score"]/0.61))
percent_err_df["Score"]=(percent_err_df["Score"]+(percent_err_df.loc["Massachusetts"]["Percent Pop with At Least One Dose"]-percent_err_df.loc["Massachusetts"].Score))
percent_err_df["Percent Error"] = (abs(percent_err_df["Percent Pop with At Least One Dose"]-percent_err_df["Score"])/percent_err_df["Percent Pop with At Least One Dose"])*100
percent_err_df.head()
| Percent Pop with At Least One Dose | Score | Percent Error | |
|---|---|---|---|
| State | |||
| Mississippi | 51.7 | 51.121820 | 1.118337 |
| Oklahoma | 58.7 | 53.534972 | 8.799025 |
| Louisiana | 53.5 | 53.562026 | 0.115937 |
| Arkansas | 57.5 | 54.330508 | 5.512160 |
| West Virginia | 48.8 | 55.429402 | 13.584839 |
percent_err_df["Percent Error"].mean()
6.551511007612933
Compared to how close we were with our Massachusetts estimate, our average percent error with our other estimates was 6.5%.
Now, we will use our scoring system to score each county based on the socioeconomic factors that we looked at at the state level. We will then attempt to see if we can create a model that will predict vaccine rates based on our COVID scores. Some of the variables we examined at the state level were unavailable or unfeasible at the county level, so this is based off of the variables we did have access to at this level.
# load in our county level data
county_stats = pd.read_csv("./data/countystats_2017.csv")
county_num = county_stats.drop(columns = ["fips", "name", "state"], axis=0)
county_num.head()
# standardize the numerical data
# need to drop categorical columns, standadrize numerical data, and then add
# categorical data back in
county_stats_std = (county_num-county_num.mean()) / county_num.std()
county_stats_std.head()
county_stats_std.insert(0, "County", county_stats["name"])
county_stats_std.insert(0, "State", county_stats["state"])
county_stats_std.insert(0, "fips", county_stats["fips"])
county_stats_std = county_stats_std.set_index("State")
county_stats_std.head()
| fips | County | pop2017 | age_under_5_2017 | age_over_65_2017 | median_age_2017 | black_2017 | native_2017 | asian_2017 | pac_isl_2017 | other_single_race_2017 | two_plus_races_2017 | hispanic_2017 | white_not_hispanic_2017 | hs_grad_2017 | bachelors_2017 | median_household_income_2017 | poverty_2017 | poverty_age_under_5_2017 | poverty_age_under_18_2017 | uninsured_2017 | unemployment_rate_2017 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | ||||||||||||||||||||||
| Alabama | 1001 | Autauga County | -0.144839 | -0.121716 | -0.808350 | -0.627219 | 0.693559 | -0.216833 | -0.144191 | -0.023637 | -0.194870 | -0.334235 | -0.470633 | -0.069117 | 0.232066 | 0.407585 | 0.422155 | -0.348214 | -0.628919 | -0.160803 | -0.451323 | -0.455762 |
| Alabama | 1003 | Baldwin County | 0.326730 | -0.121716 | 0.232939 | 0.267655 | 0.034882 | -0.145893 | -0.230359 | -0.132184 | -0.330491 | -0.354791 | -0.341574 | 0.312955 | 0.618112 | 1.021903 | 0.212660 | -0.639818 | -0.455288 | -0.557453 | -0.065996 | -0.376915 |
| Alabama | 1005 | Barbour County | -0.235578 | -0.287703 | -0.121542 | -0.272998 | 2.687504 | -0.230475 | -0.259081 | -0.132184 | 0.441502 | -0.776186 | -0.358344 | -1.549521 | -2.022447 | -0.993490 | -1.246889 | 1.723710 | 2.496442 | 2.248119 | 0.222998 | 0.781530 |
| Alabama | 1007 | Bibb County | -0.243388 | -0.121716 | -0.631109 | -0.254355 | 0.897501 | -0.208648 | -0.481681 | -0.132184 | -0.523489 | -0.765908 | -0.493965 | -0.109020 | -0.632679 | -0.864160 | -0.483732 | -0.118000 | -0.281656 | 0.409986 | -0.586187 | -0.134309 |
| Alabama | 1009 | Blount County | -0.137308 | 0.210260 | -0.121542 | -0.049280 | -0.513556 | -0.208648 | -0.431416 | -0.132184 | -0.340923 | -0.323957 | -0.008353 | 0.526935 | -0.987842 | -0.874937 | -0.178956 | -0.056610 | 0.341837 | 0.351939 | -0.027464 | -0.358720 |
county_stats_std = county_stats_std.fillna(0)
county_stats_std = county_stats_std.reset_index()
county_stats_std.head()
| State | fips | County | pop2017 | age_under_5_2017 | age_over_65_2017 | median_age_2017 | black_2017 | native_2017 | asian_2017 | pac_isl_2017 | other_single_race_2017 | two_plus_races_2017 | hispanic_2017 | white_not_hispanic_2017 | hs_grad_2017 | bachelors_2017 | median_household_income_2017 | poverty_2017 | poverty_age_under_5_2017 | poverty_age_under_18_2017 | uninsured_2017 | unemployment_rate_2017 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 1001 | Autauga County | -0.144839 | -0.121716 | -0.808350 | -0.627219 | 0.693559 | -0.216833 | -0.144191 | -0.023637 | -0.194870 | -0.334235 | -0.470633 | -0.069117 | 0.232066 | 0.407585 | 0.422155 | -0.348214 | -0.628919 | -0.160803 | -0.451323 | -0.455762 |
| 1 | Alabama | 1003 | Baldwin County | 0.326730 | -0.121716 | 0.232939 | 0.267655 | 0.034882 | -0.145893 | -0.230359 | -0.132184 | -0.330491 | -0.354791 | -0.341574 | 0.312955 | 0.618112 | 1.021903 | 0.212660 | -0.639818 | -0.455288 | -0.557453 | -0.065996 | -0.376915 |
| 2 | Alabama | 1005 | Barbour County | -0.235578 | -0.287703 | -0.121542 | -0.272998 | 2.687504 | -0.230475 | -0.259081 | -0.132184 | 0.441502 | -0.776186 | -0.358344 | -1.549521 | -2.022447 | -0.993490 | -1.246889 | 1.723710 | 2.496442 | 2.248119 | 0.222998 | 0.781530 |
| 3 | Alabama | 1007 | Bibb County | -0.243388 | -0.121716 | -0.631109 | -0.254355 | 0.897501 | -0.208648 | -0.481681 | -0.132184 | -0.523489 | -0.765908 | -0.493965 | -0.109020 | -0.632679 | -0.864160 | -0.483732 | -0.118000 | -0.281656 | 0.409986 | -0.586187 | -0.134309 |
| 4 | Alabama | 1009 | Blount County | -0.137308 | 0.210260 | -0.121542 | -0.049280 | -0.513556 | -0.208648 | -0.431416 | -0.132184 | -0.340923 | -0.323957 | -0.008353 | 0.526935 | -0.987842 | -0.874937 | -0.178956 | -0.056610 | 0.341837 | 0.351939 | -0.027464 | -0.358720 |
# this data frame contains percent of population vaccinated for each U.S. county
county_vacc = pd.read_csv("./data/County_Vacc_all.csv")
county_vacc = county_vacc.drop_duplicates()
county_vacc_mean = county_vacc.groupby("County")["Percent_Vacc"].mean()
# again, we need to standardize the numerical data
county_num = county_vacc_mean.drop(columns = ["County"], axis=0)
county_vacc_std = (county_num - county_num.mean()) / county_num.std()
county_vacc_std.head()
County Abbeville County -0.732083 Acadia Parish 0.147037 Accomack County 1.426407 Ada County 0.454372 Adair County -0.580202 Name: Percent_Vacc, dtype: float64
# add the county level vaccination data to the demographics data
# by performing an inner merge on the counties
# and drop duplicate counties since we don't have a state column to group by state and county
county_vacc_all = county_stats_std.merge(county_vacc_std, on = "County").drop_duplicates()
#county_vacc_all = county_vacc_all.drop_duplicates("County")
county_vacc_all.head()
| State | fips | County | pop2017 | age_under_5_2017 | age_over_65_2017 | median_age_2017 | black_2017 | native_2017 | asian_2017 | pac_isl_2017 | other_single_race_2017 | two_plus_races_2017 | hispanic_2017 | white_not_hispanic_2017 | hs_grad_2017 | bachelors_2017 | median_household_income_2017 | poverty_2017 | poverty_age_under_5_2017 | poverty_age_under_18_2017 | uninsured_2017 | unemployment_rate_2017 | Percent_Vacc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 1001 | Autauga County | -0.144839 | -0.121716 | -0.808350 | -0.627219 | 0.693559 | -0.216833 | -0.144191 | -0.023637 | -0.194870 | -0.334235 | -0.470633 | -0.069117 | 0.232066 | 0.407585 | 0.422155 | -0.348214 | -0.628919 | -0.160803 | -0.451323 | -0.455762 | -0.710641 |
| 1 | Alabama | 1003 | Baldwin County | 0.326730 | -0.121716 | 0.232939 | 0.267655 | 0.034882 | -0.145893 | -0.230359 | -0.132184 | -0.330491 | -0.354791 | -0.341574 | 0.312955 | 0.618112 | 1.021903 | 0.212660 | -0.639818 | -0.455288 | -0.557453 | -0.065996 | -0.376915 | -1.125185 |
| 2 | Georgia | 13009 | Baldwin County | -0.176646 | -0.785666 | -0.786195 | -1.130585 | 2.235525 | -0.222290 | 0.056866 | 0.084910 | -0.346139 | -0.303401 | -0.505632 | -1.194384 | -0.648121 | -0.195955 | -0.970096 | 1.815795 | 1.162639 | 1.474168 | 0.242265 | 0.805791 | -1.125185 |
| 3 | Alabama | 1005 | Barbour County | -0.235578 | -0.287703 | -0.121542 | -0.272998 | 2.687504 | -0.230475 | -0.259081 | -0.132184 | 0.441502 | -0.776186 | -0.358344 | -1.549521 | -2.022447 | -0.993490 | -1.246889 | 1.723710 | 2.496442 | 2.248119 | 0.222998 | 0.781530 | -0.642741 |
| 4 | West Virginia | 54001 | Barbour County | -0.261908 | -0.536685 | 0.144318 | -0.049280 | -0.559029 | -0.205919 | -0.237539 | -0.132184 | -0.507840 | -0.498682 | -0.600421 | 0.944919 | -0.292958 | -0.637832 | -0.931467 | 1.002373 | 2.235995 | 1.038821 | -0.586187 | 0.714814 | -0.642741 |
# based on the available data, we adapted our formula to create
# scores for each U.S. county
county_rank_df = pd.DataFrame()
for state, row in enumerate(county_vacc_all):
score = (0.729796*(county_vacc_all["bachelors_2017"])+
0.630108*(county_vacc_all["median_household_income_2017"]) +
-0.443451*(county_vacc_all["poverty_2017"]) +
-0.505554*(county_vacc_all["uninsured_2017"]))
county_vacc_all["Score"] = score
# to map data, all fips codes must have five digits
# if the fips code is missing a digit, that is because
# a 0 needs to be added as the first digit
county_vacc_all["fips"] = county_vacc_all["fips"].astype('int')
county_vacc_all["fips"]=county_vacc_all["fips"].apply(lambda x: '{0:0>5}'.format(x))
county_vacc_all = county_vacc_all.dropna()
county_vacc_all.head()
| State | fips | County | pop2017 | age_under_5_2017 | age_over_65_2017 | median_age_2017 | black_2017 | native_2017 | asian_2017 | pac_isl_2017 | other_single_race_2017 | two_plus_races_2017 | hispanic_2017 | white_not_hispanic_2017 | hs_grad_2017 | bachelors_2017 | median_household_income_2017 | poverty_2017 | poverty_age_under_5_2017 | poverty_age_under_18_2017 | uninsured_2017 | unemployment_rate_2017 | Percent_Vacc | Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 01001 | Autauga County | -0.144839 | -0.121716 | -0.808350 | -0.627219 | 0.693559 | -0.216833 | -0.144191 | -0.023637 | -0.194870 | -0.334235 | -0.470633 | -0.069117 | 0.232066 | 0.407585 | 0.422155 | -0.348214 | -0.628919 | -0.160803 | -0.451323 | -0.455762 | -0.710641 | 0.946042 |
| 1 | Alabama | 01003 | Baldwin County | 0.326730 | -0.121716 | 0.232939 | 0.267655 | 0.034882 | -0.145893 | -0.230359 | -0.132184 | -0.330491 | -0.354791 | -0.341574 | 0.312955 | 0.618112 | 1.021903 | 0.212660 | -0.639818 | -0.455288 | -0.557453 | -0.065996 | -0.376915 | -1.125185 | 1.196872 |
| 2 | Georgia | 13009 | Baldwin County | -0.176646 | -0.785666 | -0.786195 | -1.130585 | 2.235525 | -0.222290 | 0.056866 | 0.084910 | -0.346139 | -0.303401 | -0.505632 | -1.194384 | -0.648121 | -0.195955 | -0.970096 | 1.815795 | 1.162639 | 1.474168 | 0.242265 | 0.805791 | -1.125185 | -1.681967 |
| 3 | Alabama | 01005 | Barbour County | -0.235578 | -0.287703 | -0.121542 | -0.272998 | 2.687504 | -0.230475 | -0.259081 | -0.132184 | 0.441502 | -0.776186 | -0.358344 | -1.549521 | -2.022447 | -0.993490 | -1.246889 | 1.723710 | 2.496442 | 2.248119 | 0.222998 | 0.781530 | -0.642741 | -2.387838 |
| 4 | West Virginia | 54001 | Barbour County | -0.261908 | -0.536685 | 0.144318 | -0.049280 | -0.559029 | -0.205919 | -0.237539 | -0.132184 | -0.507840 | -0.498682 | -0.600421 | 0.944919 | -0.292958 | -0.637832 | -0.931467 | 1.002373 | 2.235995 | 1.038821 | -0.586187 | 0.714814 | -0.642741 | -1.200567 |
Now we have county data that we can analyze, yay! It should be noted that the most recent county level data that we could find for a lot of the features we wanted to analyze was from 2017. However, the COVID vaccination data is relatively new as of 2019-2021, so this will undoubtedly result in some error in our analysis.
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
# https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json
import plotly.figure_factory as ff
fips = county_vacc_all["fips"]
values = county_vacc_all["Score"]
fig = px.choropleth(county_vacc_all, geojson=counties, locations = "fips",
color= "Score", scope="usa")
fig.show()
The map above visualizes how each county ranked on our scoring system. The higher the score, the higher the expected vaccine coverage for a county.
The purpose of this model is to see how accurately our formula (adapted to the county-level data available to us) performs by using our formula to score each county based on their socioeconomic stauses, and then see how well these scores can predict vaccination rates
# create a linear regression model
# used inspiration from Eli and Matt's past final project
# https://mcatalano26.github.io/Data-Science-Final-Project/
# also learned some stuff from the tutorial linked below
# https://medium.com/@julie.yin/understanding-the-data-splitting-functions-in-scikit-learn-9ae4046fbd26
#county_vacc_all = county_vacc_all.drop_duplicates("County")
X = county_vacc_all[["Score"]]
y = county_vacc_all[["Percent_Vacc"]]
# split data into training and testing set
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,
y, test_size=0.33, random_state=0)
# used help from this source below as well for generating a linear regression model
# https://stackoverflow.com/questions/29934083/linear-regression-on-pandas-dataframe-using-sklearn-indexerror-tuple-index-ou
# fit the model to the county data
# to make predictions about percent_vacc
model = sklearn.linear_model.LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X)
# add test and predicted values to data frames
y_df = pd.DataFrame(y_pred)
train_df = pd.DataFrame(y_test)
train_df = train_df.reset_index()
train_df = train_df.drop(columns = ["index"])
# calculate rmse
error = sklearn.model_selection.cross_val_score(model, X, y,
scoring="neg_root_mean_squared_error")
rmse = np.mean(error) * -1
rmse
0.7271006415267564
plt.scatter(X,y,color = 'red')
plt.plot(X, y_pred, color = "blue")
plt.xlabel("Our Calculated Score")
plt.ylabel("Predicted Vaccination Rates")
plt.title("Regression Line Model")
Text(0.5, 1.0, 'Regression Line Model')
# create final data frame comparing predicted Percent Vaccinations
# based on COVID Scores
y_df = pd.DataFrame(y_pred)
train_df
compare_df = y_df.join(train_df)
compare_df = compare_df.rename({0: "Predicted Percent Vacc"}, axis=1)
compare_df.head()
| Predicted Percent Vacc | Percent_Vacc | |
|---|---|---|
| 0 | 0.053109 | -0.474779 |
| 1 | 0.089906 | -0.224623 |
| 2 | -0.332413 | 0.023150 |
| 3 | -0.435963 | -3.305116 |
| 4 | -0.261793 | 0.475814 |
# plot predicted percent vaccination vs. actual percent vaccinated
compare_df_plot = compare_df[120:140]
compare_df_plot.plot(y=["Predicted Percent Vacc", "Percent_Vacc"], kind = "bar", grid = True)
plt.xlabel("Random County")
plt.ylabel("Percent of County Population Vaccinated")
plt.title("True vs. Predicted Vaccination Rates for Counties")
Text(0.5, 1.0, 'True vs. Predicted Vaccination Rates for Counties')
The bar plot provides a visual representation of the predicted vs. the actual vaccination rates. Clearly, there are some large margins of errors in our predictions, but as mentioned previously, there were less correlational factors to consider at the county level since less data was available.
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
# geojson file from:
# https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json
import plotly.figure_factory as ff
# merge our data frame with predicted values onto our initial df because we
# need fips codes for each county to plot the data
final_county_df = compare_df.merge(county_vacc_all, on = "Percent_Vacc")
fips = final_county_df["fips"]
values = final_county_df["Predicted Percent Vacc"]
fig = px.choropleth(final_county_df, geojson=counties, locations = "fips",
color= "Predicted Percent Vacc", scope="usa")
fig.show()
The map above visualizes the predicted vaccination rate of each county that our model predicted for available counties based on the score we calculated for each county. In essence, we attempted to build a model that could predict vaccination rates based on the scoring system we created from the state level data. Some counties are missing because since our county level data frame for vaccination rates did not have a state column, we had to drop duplicate counties since our data was based on the average vaccine rate for each county, we didn't want averages from counties over different states.
While our prediction model did not form particularly well in estimating vaccination rates for the states and counties, it was interesting to see how our state-level scoring system could be applied at the county level. It is interesting to see which factors have the largest correlations to vaccination rates and which states these findings hold true for. The factor we observed to have the largest impact on vaccination rates for states was political affiliation. It would be very interesting to see if this holds true in other countries across the world.
According to our predictions, the states that should have the highest vaccination rates are Massachusetts, Connecticut, Maryland, New Jersey, and Vermont. The actual states with the highest vaccination rates are Massachusetts, Vermont, Hawaii, Connecticut and Rhode Island. Our scoring system put Hawaii at number 8, Vermont at 6, and Rhode Island at 7, so even though our predicted ordering was not exactly perfect, we were rather close at predicting the general ranking of the states. The states with higher scores tend to have higher socioeconomic statuses, which is what we had initially predicted.
There are many places this research could be used in the future. States with lower vaccination rates can be more targeted in distribution campaigns, and if they are estimated to have a low vaccination rate because of specific variables, we can use that to have a more targeted approach. It would also be interesting to look at correlation numbers for other parts of the world and see if similar factors have a similar amount of impact on vaccination. We could also apply our prediction formula for the US to these other parts of the world and see how other countries would be vaccination wise if they were affected by the factors in the same way that the USA is. Machine learning could even be incorporated to see if factors that humans could not recognize could be discovered. There are a lot of weaknesses in this project due to the lack of available data and the inability to analyze every factor, but it was interesting to visualize our hypotheses and put them to the test!